Skip to content

bloom filter bulk inserts and queries #723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dcoutts
Copy link
Collaborator

@dcoutts dcoutts commented May 15, 2025

Two changes, both from bulk operations for bloom filters.

For Bloom filter inserts in run accumulation: instead of inserting keys into the bloom filter one by one as they are added to the run accumulator, save them up and add them all in one go when the page is being finalised. This then lets us use a bloom filter bulk insert, which lets us use memory prefetching.

For Bloom filter queries in key lookups, update the existing bulk query code to properly take advantage of the new API and use prefetching. We can now also simplify and use a single high performance implementation, rather than needing two (a more compatible one and a faster one that relied on fancier features available in later GHC versions).

Results for the WP8 benchmark (100M elements, 10bits per key, full caching, 10k batches of 256 keys):

  • baseline: 88,871 ops/sec
  • with bulk inserts: 97,152 ops/sec, ~9% improvement
  • with bulk queries: 103,005 ops/sec, ~6% additional, ~16% cumulative improvement

So overall about a 16% improvement in ops/sec on the primary WP8 benchmark, and as a bonus, getting over the magic 100k ops/sec threshold (on my laptop).

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. I have yet to look at the bulk insert code

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! If we fix the test failure that I noted in my previous review, then we're good to go

@dcoutts
Copy link
Collaborator Author

dcoutts commented May 15, 2025

CI failure due to flaky test fixed in #724.

@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 3 times, most recently from 09afe55 to 49f8071 Compare May 16, 2025 00:06
@dcoutts dcoutts changed the title Bulk insert for bloom filters in RunAcc bloom filter bulk inserts and queries May 16, 2025
Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the new bulk query changes

Comment on lines 139 to 145
loop2_prefetch :: KeyIx -> ST s ()
loop2_prefetch !kix
| kix == ksN = pure ()
| otherwise = do
let !keyhash = P.indexPrimArray keyhashes kix
Bloom.prefetchElem filter keyhash
loop2_prefetch (kix+1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message for 8eaf7ce says that the prefetch distance is the number of runs, but this looks like the prefetch distance is the number of keys

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, oops. I kept changing my mind. I did both versions and benchmarked them.

Copy link
Collaborator Author

@dcoutts dcoutts May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commit message.

So in the benchmarks I did, nesting the loops this way round was slightly faster. But arguably, the other way round should have more stable prefetching behaviour, since it would depend on number of runs and not the lookup batch size.

Switching it round is not that hard. I could still do that. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched it round.

dcoutts and others added 3 commits May 20, 2025 10:07
Instead of inserting keys into the bloom filter one by one as they are
added to the run accumulator, save them up and add them all in one go
when the page is being finalised.

This then lets us use a bloom filter bulk insert, which lets us use
memory prefetching.

The result should be faster.
Fetch into the caches in the least intrusive way, "0" levels, rather
than "3" levesl. This does not appear to slow down inserts, and should
evict fewer things from the caches.

And document what level "0" means and why we use it.
`bloomQueriesModel` was not really a proper model, because the model itself was
using actual bloom filters. The model is now instead a `Set`, and
`prop_bloomQueriesModel` is updated because the model will now only return true
positives and negatives.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 49f8071 to 6dac4ec Compare May 20, 2025 09:13
dcoutts added 5 commits May 20, 2025 10:45
Previously, for the Classic Bloom filter implementation we had two
different implementations of bloomQueries: one that was relatively
simple and didn't rely on anything fancy, and one that went all out to
maximise performance.

The high performance one had to be disabled when we added the
block-structured bloom filter implementation, since it was tightly
coupled to that implementation.

With the new bloom filter API and implementation, we can now implement a
single high performance version of bulk query. We no longer need
separate higher and lower performance versions, since we no longer need
to rely on fancy features like unlifted boxed arrays.

So strip out the bloom-query-fast cabal flag and the BloomFilterQuery2
module.

The updated BloomFilterQuery1 does a simple nested loop: iterating over
all filters and within that over all keys. It does two loops over the
keys: one to prefetch the key for the filter, and a second one to do the
lookups for real. Thus the prefetch distance is the number of keys,
which is of course somewhat variable. In benchmarks this version was
slightly faster than nesting the loops the other way around, but perhaps
the other way around would have more stable prefetching behaviour over a
wider range of key batch sizes. Changing this would not be too hard.
The query module used to be big, and there used to be two of them. Now
there's only one and it's a lot smaller. So it makes sense to keep it
all together in one module.
We were using a mix of import qualified as BF, and import as Bloom.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 6dac4ec to ef64262 Compare May 20, 2025 12:19
This means our prefetching is done with a prefetch distance of the
number of runs, which is more stable than the key batch size.

Also tweak the code so we get only joinrec and not letrec in the core.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from ef64262 to a02e702 Compare May 20, 2025 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants